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[57] ABSTRACT 

The present invention is a method and apparatus for docu- 
ment clustering-based browsing of a corpus of documents, 
and more particularly to the use of overlapping clusters to 
improve recall. The present invention is directed to improv- 
ing the performance of information access methods and 
apparatus through the use of non-disjoint (overlapped) clus- 
tering operations. The present invention is further described 
in terms of two possible methods for expanding document 
clusters so as to achieve the overlap, and a method for 
increasing precision through the use of the overlapped 
clusters. 

11 Claims, 8 Drawing Sheets 
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METHOD AND APPARATUS FOR search and retrieval methods. In general clustering relies on 

INFORMATION ACCESSS EMPLOYING the fact that mutually similar documents will tend to be 

OVERLAPPING CLUSTERS relevant to the same queries, hence, automatic determination 

of clusters (sets) of such documents can improve recall by 

This invention relates generally to a document 5 effectively broadening a search request Typically a fixed 

clustering-based searching or browsing procedure for a "H™ of *™**t» " clustered either into an exhaustive 

corpus of documents, and more particularly to the use of P*** 0 *- ***** °5 toto * Werarchical tree structure, 

overlapping clusters. In the case of a partition, queries are matched against 

clusters, and the contents of some number of the best scoring 

BACKGROUND AND SUMMARY OF THE io clusters are returned as a result possibly sorted by score. In 

INVENTION the case of a hierarchy, queries are processed downward. 

always taking the highest scoring brand), until some stop- 
Methods of searching or browsing a corpus of documents ping condition is achieved. The subtree at that point is then 
that involve repeated choice between a number of n^ed as a result 

alternatives— each set of possible choices contained within xj.a^a ™ «„«:uki* «,k;^k »«, 

that alternative selected aVthe previous stage of choice— 15 Hyhnd *? tegICS »e also available, which are essentially 

« oi^iiauvc m uic picvwiH. »«gc wi variations of near-neighbor searching, where nearness is. 
suffer froma common difficulty. Once an mcc^ choice is defined fc ^ rf doc^ent similarity mea- 
made. Aere is no way to recover. Wrong choices will be m ^ fof c^^ 1 !^ duSer search techniques 
I^T^ !*? ^ docu^nts being sought ^ c^ed to sirnilarity search, a direct near- 
he close to a boundary ^twecn one choice and anote.The ne ighbor searched are evaluated in terms of precision and 
appropriate remedy to this problem is to arrange for dea- recalL as by q. Salton and M. J. McGOl in 
sions regarding the selection of choices not to be incorrect ^ IntfoduciUm to Mod / m fnformation RetHeval » McGraw- 

Since the choice is among bundles of documents, it is both Hill. 1983. Also noted is G. Salton's "Automatic Text 

convenient and suitable to refer to those bundles as dusters. Pwcessing, " Addison- Wesley. 1989. 

Each cluster in the first stage of choices will be comprised ^ m ordcr to duster documcn ts. it is necessary to first 

of the documents belonging to the set of second stage establish a pahwisc me&SUIC ^ ^c^^ similarity and 

dusters mat correspond to it And each second stage cluster ^ a method f m U5m « that measure to form sets of 

will comprise corresponding third stage dusters, the subdi- documents, dusters. Numerous document simi- 

vision continuing until the n* stage clusters are small ^ mcasures ^ ^ ^oposed. all of which consider 

enough to allow attention to individual documents. „ ^ of word overlap between the two documents of 

Such procedures of stagewise choice have been used most interest described as sets of words, often with frequency 

frequently when access is based, interactively, on the user's information. These sets are typically represented as sparse 

judgment limitations of display methods and or the user's vectors of length equal to the number of unique words (or 

short-term memory make it infeasible to go at once to the types) ^ me corpus. If a word occurs in a document its 

many last-stage clusters. The difficulty arising from mis- 35 location in this vector is occupied by some positive value 

taken choices when what is sought falls near a division (one if only presence/absence herniation is considered, or 

between dusters is often addressed by allowing the user to some function of its frequency within mat document if 

choose two or more dusters in indedsive situations. This frequency is considered). If a word does not occur in a 

leads to the proliferation of paths unless, as illustrated by the document Us location in this vector is occupied by zero. A 

scatter-gather method taught in U.S. Pat No. 5.442.778 to ^ popular similarity measure, the cosine measure, determines 

Ptdersen et at. the clustering is always done "on the fly" at the cosine of the angle between two sparse vectors. If both 

each stage of choice. This ameliorates the difficulty near the document vectors are normalized to unit length, this is of 

margins, but enforces an increase in the number of stages course, simply the inner product of the two vectors. Other 

because of repeated doublings. The present invention attacks measures include the Dice and Jaccard coefficient which arc 

the previously noted difficulty more efficiently by planning 43 normalized word overlap counts. It is suggested that the 

for overlap at the margins— so that every cluster is moder- choice of similarity measure has less qualitative impact on 

atdy larger than a duster from a corresponding set of clustering results man the choice of dustering procedure, 

disjoint (Lc. non-overlapping) dusters would be. Accordingly, the present invention focuses on the method by 

Stage-by-stage choice has not been commonly used in which clusters are generated and does not rely on a particular 

search methods that rely on a noninteractive specification of 50 similarity measure. Words are often replaced by terms, in 

a query which is compared with whatever dusters are which gentle stemming has combined words differing only 

relevant The costs due to taking account of the marginality by simple suffixes, and words on a stop list are omitted, 

problem have outweighed the reduced computational load Standard hierarchical document clustering techniques 

that would be associated with a stage-by-stage approach. As employ a document similarity measure and consider the 

a result, query-based systems usually rdy on comparisons to 55 sirmlarities of all pairs of documents in a given corpus, 

the query with either every smallest cluster or even, most Typically, the most similar pair is fused and the process 

extremely, with each document Either clearly avoids the iterated, after suitably extending the similarity measure to 

marginality problem but at the cost of much more extensive operate on agglomerations of documents as well as indi- 

computatioD. Here again overlapping dusters, where mar- vidual documents. The final output is a binary tree structure 

ginal cases bdong to two or more clusters at each specific $0 that records the nested sequence of pairwise joints, 

stage, can reduce the marginality problem, while preserving Traditionally, the resulting trees had been used to improve 

most of the computational savings. the efficiency of standard Boolean or relevance searches by 

Accordingly, the present invention is directed to improv- grouping together similar documents for rapid access. The 

ing the performance of information access methods and resulting trees have also lead to the notion of duster search 

apparatus as the result of tbe use of non- disjoint 65 in which a query is matched directly against nodes in the 

(overlapping) dustering operations. Document dustering cluster tree and the best matching subtree is returned, 

has been extensively investigated for improving document Counting all pairs, the cost of constructing the cluster trees 
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can be no less than proportional to N 2 , where N is the U.S. Pat. No. 5,442.778 to Pedersen etaL, issued Aug. 15, 

number of documents in the corpus. Although cluster 1995, for a •'Scatter-Gather: A Cluster-Based Method and 

searching has shown some promising results, the method Apparatus for Browsing Large Document Collections." Ped- 

tends to favor the most determinationally expensive simi- ersen et al., hereby incorporated by reference for its 

larity measures and seldom yields greatly increased perfor- 3 teachings, discloses a document clustering-based browsing 

mance over other standard methods. procedure for a corpus of documents. The methods 

One stage methods are intrinsically quadratic in the described for partitional clustering include a Buckshot 

number of documents to be clustered, because all pairs of method, a Fractionation method, both of which may be 

simiiarities must be considered. This sharply limits their employed to produce input for a cluster digest method for 

usefulness, even given procedures that attain this theoretical 10 determining a summary of the ordering of a corpus of 

upper bound on performance. Partitional strategies (those documents in the Scatter- Gather technique. "Recent trends 

that strive for a flat decomposition of the collection into sets in hierarchic document clustering: A critical review** by 

of documents rather than a hierarchy of nested partitions) by Peter Willett, Information Processing of Management Vol. 

contrast are typically rectangular in the size of the partition 24, No. 5. pages 577-97 (1988— printed in Great Britain), 

and the number of documents to be clustered. Generally. 13 describes the calculation of mterdocument similarities and 

these procedures proceed by choosing in some manner, a clustering methods that are appropriate for document clus- 

number of seeds equal to the desired size (number of sets) tering. "Understanding Multi-Articled Documents" by 

of the final partition. Each dc<ument in the collection is then Tsujiraoto et al., presented in June 1990 in Atlantic City; NJ. 

assigned to the closest seed. As a refinement the procedure ** ^ International Conference for Pattern Recognition, 

can be iterated with, at each stage, a hopefully improved 20 descriDes 811 attempt to build a method to understand docu- 

selection of cluster seeds. However, to be useful for cluster mcnt layouts without the assistance of character recognition 

search the partition must be fairly fine, since it is desirable results, ie., the meaning of contents, 

for each set to only contain a few documents. For example. P.. Willett. in "Document Clustering Using an Inverted 

a partition can be generated whose size is related to the File Approach,** Journal of Information Science, Vol. 2 

number of unique words in the document collection. u (1980). pp. 223-31. teaches a method for generating over- 

Accordingly, the potential benefits of a partitional strategy lapping document clusters. 

are largely obviated by the large size (relative to the number As will be appreciated, various information access tech- 

of documents) of the required partition. For this reason niques use subdivision of the initial corpus, or one of its 

partitional strategies have not been aggressively pursued by subcorpora. into clusters — often with the purpose of seeking 

the information retrieval community. M the user's aid in selecting one or more clusters to serve as a 

The standard cluster search presumes a query, the user's subcarpus for a subsequent iterative stage. Conventionally, 

expression of an information need The task is then to search these dusters are selected so that (a) their union covers the 

the collection of documents that are identified as matching whole of the initial corpus, and (b) the individual clusters are 

this need. However, it is not difficult to imagine a situation disjoint (non-overlapping). Unfortunately, disjoint clusters 

in which it Is hard, if not impossible to formulate such a 35 have practical disadvantages when the document that is 

query, or where the results of the query are voluminous. One sought falls near, or even across, a cluster boundary, so that 

merely has to consider an exemplary search on the Internet, at least two parallel clusters must be selected to avoid the 

and the potential for voluminous results, to gain an imme- loss of the document The present invention, however, 

diate appreciation for clustering-browsing functionality. As avoids the need for such parallelism and allows the user 

another example, the user may not be familiar with the 40 access to clusters that overlap so as to make choosing a 

vocabulary appropriate for describing a topic of interest or single cluster bom natural and efficient 

may not wish to commit to a particular choice of words. In accordance with the present invention, there is pro- 

Indeed, the user may not be looking for anything specific at vided a method, operating in a digital computer, for search- 

alt but rather may wish to gain an appreciation for the ing a corpus of documents, comprising the steps of: prepar- 

general information content of the collection. Jt seems 45 ing an initial structuring of the corpus into a plurality of 

appropriate to describe this as browsing, since it is at one overlapping clusters, wherein at least two of the plurality of 

extreme of a spectrum of possible information access overlapping clusters contain a single document; and deter- 

situations, including open-ended questions with a variety of mining a summary of the plurality of clusters prepared by 

possible answers. said initial structuring of the corpus. 

In proposing an alternative application for clustering in so In accordance with another aspect of the present 

information access the present invention is based upon invention, mere is provided a document browsing system for 

methods typically provided with a conventional text book. If u se with a corpus of documents stored in a computer system, 

one has a specific question in mind, and specific terms which the document browsing system comprising: program 

define that question, one consults an index, which directs memory for storing executable program code therein; a 

one to passages of interest keyed by search words. 55 processor, operating in response to the executable program 

However, if one is simply interested in gaining an overview. stared in said program memory, far automatically preparing 

one can turn to the table of contents which lays out the a structuring of the corpus of documents into a plurality of 

logical structure of the text for perusal The table of contents document clusters, wherein at least two of the plurality of 

gives one a sense of the types of questions mat might be document clusters overlap and contain at least one common 

answered if a more intensive examination of the text were 60 document therebetween; data memory for storing data iden- 

attempted, and may also lead to specific sections of interest tifying the documents associated with each of the plurality 

One can easily alternate between browsing the table of of document dusters; processor summarising the plurality of 

contents, and searching the index or, more importantly, an document clusters and generating summary data for said 

iterative combination of both. document clusters; and a user interface for displaying the 

Heretofore, publications have disclosed clustering 65 summary data, 

techniques, the relevant portions of which may be briefly To provide the flexibility required to deal with nonspecific 

summarized as follows: user's requirements, a browsing system usually requires 
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means for broadening the working corpus as well as nar- A processor or computing machine implementing the 

rowing it This invention preferably concerns the narrowing invention can include a monitor or display to assist the user 

aspect and its description assumes tacitly the existence of in the visualization of the clustering operation so as to allow 

broadening operations. ''browsing" of the corpus in an orderly fashion. Such a 

In accordance with yet another aspect of the present 3 display preferably shows the results of a query in a clustered 

invention, mere is provided a document search and retrieval format » enable user to iteratively review documents 

method operating in a digital computer, for searching a within a "^P 05 mat rclate 10 a desired topic, 

corpus of documents, cornprising the ster^^ntirying. in DESCRIPTION OF THE DRAWINGS 
response to at least one search term, a sub-corpus of docu- 

ments containing the at least one user specified search term; 10 FIG. l is a block diagram of the hardware components 

preparing an initial structuring of the sub-corpus into a used to practice the present invention; 

plurality of overlapping clusters, wherein at least two of the piG 2 is a high level flowchart of a preferred oribodiment 

plurality of overlapping clusters contain a single document; of |he document browsing method according to the present 

and determining a summary of the plurality of overlapping invention' • 

clusters prepared by said initial structuring of the sub- " n(JS 3 ^ 4 ^ flowdlart5 mustratiDg . „ accordance 

with the present invention, details of the structuring opera- 

' In~ accordance with a further aspect of the present " tions defected m HG 2- ~ " 

invention there is provided a document searching system illustrative diagram depicting a preferred 

for use with a corpus of documents ^stored In a computer Jg£^ £ ™ docJmem^i nJthofof the 

system, the document searching system ^mprismg: pro- 20 f^^J^ ^ Ued to a ^ * documents 

gram memory for storing executable program code therein; ui edb a search* F 

a processor, operating in response to the executable program re ev y a , _ 

stored in said program memory, for automatically preparing FIGS. 6, 7 and 8 are exemplary user interface screens 

an structuring of the corpus of documents into a plurality of displayed in accordance with the browsing operations 

document clusters, wherein at least two of the plurality of 25 ducted in FIG. 5; and 

document clusters overlap and contain at least one common FIG. 9 is a flowchart illustrating the additional steps 

document therebetween; data memory for storing data idea- necessary to identify an inner cluster in accordance with the 

drying the documents associated with each of the plurality present invention. 

of document clusters; memory access means for accessing The present invention will be described in connection 

the data memory and said processor summarizing the plu- 30 with a preferred embodiment (e.g., iterative structuring of a 

rality of document clusters and generating summary data for document corpus), however, it will be understood that there 

said document clusters; and a user interface for displaying is no intent to limit the invention to the embodiment 

the summary data. described On the contrary, the intent is to cover all 

In accordance with yet another aspect of the present 35 alternatives, modifications, and equivalents as may be 

invention, mere is provided a method operating in a digital included within the spirit and scope of the invention as 

computer, for searching a corpus of documents, comprising defined by the appended claims. 



the steps of: subdividing a corpus of documents into a 
hierarchical structure containing a plurality of levels of 
clusters, wherein at least two of the clusters on a particular 
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level are overlapping clusters containing at least a single For a general understanding of the present invention, 
document in common; selecting, from the hierarchical reference is made to the drawings. In the drawings, like 
structure, a plurality of clusters to form a subcorpus, wherein reference numerals have been used throughout to designate 
the subcorpus contains fewer document than the corpus; and identical elements. In describing the present invention, the 
identifying, in response to a search quay, those documents 4J following term<s) have been used in the description, 
in the subcorpus providing a positive response to the search a "document" as used herein refers to a digital medium of 
query. communication, including, but not limited to: bitmap image 

One aspect of the invention is based on the observation of representations of hardcopy materials, electronically corn- 
problems with conventional document search and retrieval posed pages (e.g., ASCII or PDL formats such as Inter- 
techniques— disjoint clustering — where a user can select 50 press® and PostScript®), e-mail or similarly transmitted 
only one cluster in order to obtain a particular document messages, and equivalent manifestations of digital informa- 

Tnis aspect is based on the discovery of a technique mat tion. Documents may also contain images, text, graphics, 
alleviates these problems by allowing documents within the sound media clips and other elements therein. Furthermore, 
corpus to be associated with a plurality of clusters, where the present invention is intended for use wife any type of 
such a technique would be characterized as having overlap- 55 document for which a similarity metric is determinable. The 
ping clusters. This technique can be implemented, for term "corpus" refers to a collection or set of documents. A 
example, by clustering related documents into non-disjoint "corpus" may be used to represent an entire collection of 
clusters. Here documents only moderately related to a par- materials, or it may be used to represent a "sub-corpus" of 
ticular attractor, or cluster vector, win also be moderately a larger collection that is used as the input to the structuring 
related to another attractor and will, therefore, be associated 60 methods described herein. As used herein to describe the 
with both attractors (overlap). Thus, it is believed that this results of searches conducted on a corpus, "precision" is the 
aspect of the invention not only favors recall, but may ratio of relevant documents retrieved to the total number of 
ultimately favor precision as well Precision is favored documents retrieved, "Recall" is the ratio of relevant docu- 
becausc the present invention allows the user to initially ments retrieved to the total number of relevant documents, 
review a broader range of documents and to subsequently 65 A 'feature" is an element of a document by which the 
focus on documents belonging only to a single inner cluster document can be partially described so as to enable a 
and to no other clusters. similarity determination. A document may. for example, be 
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represented as a one-dimensional array or vector of features, input/output devices. Examples include semiconductor 

where the array contains an entry for each feature in the ROMs, RAMs, and storage medium access devices with 

corpus to which the document belongs and where the data storage media that they can access. A "memory cell** is 

document array has a non-zero entry to indicate that the memory circuitry that can store a single unit of data, such as 

feature occurs within the document A feature may be a 5 a hit or other n-ary digit or an analog value, 

word, a statistical ptoase. an algorittaically rotated coor- A p^^g system" is a physica j system that 

dinate (such as those obtained from singular value decorn- ^.^^ A » w t™ e/w ». ;„ Artx , ™L™,-«» «„*™ 

position (SVD) analysis of the word by document matrix), £* esscs data ' A A J™ 0 ^ a ^ COmponcnt * ***** 

cTsimflar unit ofuia^ding into which the document ^ ^ mclude one OT more 

may be divided. SVD is a nJrix factorization technique. |fl P™^g or other processing components. Aproccs- 

Basically, a words versus document cooccurrence matrix is 10 «f P^ 0 ™* ** <* a "automaucally 

factored via SVD and only the higiest weighted rotated when 11 performs the operation or function independent of 

pseudo-dimensions are retained to achieve a dimensionality concurrent human control. Typically such operations are 

reduction. Incorporated herein by reference is a publication executed in response to a series of code instruction stored in 

describing the technique by Scott Deerwester, Susan T. memory accessible by the processor. "Code" means data 

Dumais. George W. Furnas. Thomas K. Landauer and Rich- 15 indicating instructions, but in a form that the processor can 

ard Harshman. "Indexing by latent semantic analysis. 1 ' Jour- execute. 

naLof the American Society for Information Science, Vol "User input circuitry** is circuitry for r^oviding signals 

41, No. 6 (1990), pp. 391-407. and the U.S. Pat No. based on actions of a user. User input circuitry can receive 

4.839.853 patent to Deerwester et al., issued Jun. 13. 1989 sigDals from onc OT morc *^ scr ^ ^ provide 

for Computer Mormation Retrieval Using Latent Semantic 20 signals bascd ^ acdons of a uscr< such ^ a keyboard or a 

S^cture. The patent teaches air*thodolo^or retoeving mouse ^ xt of sigoals provided by user input circuitry 

T^i^^L KES? ft f w^AbS can therefore incluck data indicating mouse operation and 

semantic structure m the usage of words in data objects. . ^ i^^^a ™^*2^„ c;™..„ ^™ 

Estimates of this latent structure are utilized to represent and *«* tey^ operation. Signals from user input 

retrieve objects while the user query is recoTched in a M circuitry may include a^uest" for an operaUon, in which 

statistical detain and processed to extract the underlying a ^nem P** 0 ™ * c requested operation in 

meaning to respond to the query. The term "browsing" is response. 

used herein to describe the act of interactively choosing A "hierarchical structure** is a structure mat is perceptible 

among clusters of one or more of a plurality of documents a* having a number of levels. A hierarchical node-link 

at successive stages. The standard fcrmulation of a cluster ^ structure, for example, could have a number of levels of 

search presumes a "query" — a user's expression of an nodes, with links connecting each node on a lower level to 

information need. The task is then to search the corpus of one of the nodes on an upper level. A common characteristic 

documents for those documents meeting this need. of display systems is a mapping between items of data 

However, the user may not wish or be able to furnish a within the system and display features presented by the 

query, be familiar with the vocabulary appropriate for 33 system. A structure "represents" a body of data when display 

describing a topic of interest, or may not wish to commit to features of the structure map one-to-one with the items of 

a particular choice of words. Indeed, the user may not be data in, the body of data. For example, each node of a 

looking for anything specific at all. but rather may wish to hierarchical node-link structure could represent a node in a 

gain an appreciation for the general information content of tree of data or in another hierarchy of data items such as a 

the collection. This "search*' is, therefore, characterized as ^ directed graph that is organized into levels. The links of the 

browsing since it is not associated with a formal query; it is structure can represent relationships between items of data, 

an open-ended process with a variety of possible results. so that the links in a hierarchical node-link structure can 

Accordingly, the present invention facilitates the open- represent hierarchical relationships such as parent-child 

ended nature of the search spectrum by allowing a user to relationships. 

"browse** a corpus of documents that has been structured or 45 A "selectable unit** is a display feature that is perceived as 

divided into a plurality of related clusters. a bounded display area that can be selected. For example. 

The term "data** refers to physical signals that indicate or button 164 in FIG. 6. The term "select," when used in 

include information. When an item of data can indicate one relation to a selectable unit means a user input operation 

of a number of possible alternatives, the item of data has one that includes a signal that uniquely indicates the selectable 

of a number of Values." The term "data" includes data 30 unit - general* an action by a user "indicates" a thing, an 

existing in any physical form, and includes data that are event, or a characteristic when the action demonstrates or 

transitory or are being stored or transmitted. For example, points out the thing, event or characteristic in a manner that 

data could exist as electromagnetic or other transmitted is distinguishable from actions mat do not indicate the thing, 

signals or as signals stored in electronic, magnetic, or other event, ox characteristic. The user can. for example, use a 

form, " 55 pointing device such as a mouse to select a selectable unit by 

A "data storage medium** or "storage medium** is a indicating its position and clicking a button on the pointing 

physical medium that can store data. Examples of data device. In general, a selectable unit may take any 

storage media include magnetic media such as diskettes. appearance, and is not limited to a visually distinguishable 
floppy disks, and tape; optical media such as laser disks and feature or set of features that appears to be a coherent unity. 

CD-ROMs; and semiconductor media such as semiconduc- <so An "image input device" is a device that can receive an 

tor ROMs and RAMs. As used herein, "storage medium** image and provide an item of data defining a version of the 
covers one or more distinct unit s of a medium that together image. A "scanner*' is an image input device that receives an 
store a body of data. For example, a set of floppy disks image by a scanning operation, such as by scanning a 
storing a single body of data would together be a storage document 

medium. 65 An 'Image output terminal** (LOT) is a device that can 
"Memory circuitry" or "memory** is any circuitry that can receive data defining an image and provide the image as 
store data, and may include local and remote memory and output, for example, in printed form. A "display" is an image 
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output device that provides the output image in human Referring now to FIG. 3. shown therein are more detailed 

viewable form. The visible pattern presented by a display is steps of the structuring process employed by the present 

a "displayed image** or simply 'Image.** invention, as previously represented by steps 50 and 56 of 

Referring to FIG. 1, the present invention can be imple- FIG. 2. As will be understood by those skilled in the art 

merited in a document corpus browsing system 12. The 5 clusters are often defined by a set of attractors, each essen- 

sy stem includes a central processing unit 14 (processor) for dally a vector that summarizes the vectors of each document 

receiving signals from, and outputting signals to various belonging to the cluster, e.g., a centroid of those vectors. As 

other components of the system, according to one or more usc< j m referring to a relationship between a vector and an 

programs run on central processing unit 14. The system attractor. the "closeness" is frequently evaluated in terms of 

includes a read only memory (ROM) 16 for storing operat- 10 ^ ^ „ „ k n tetwccn ^ vcctQrs ^ cosincof 

ing programs in the form of executable code, A random the angle may be represented as: 
access memory (RAM) 18 is also provided for running the 

various operating programs, and additional files on memory iaA (0 
storage device 20 could be provided for overflow and the 

storage of structured corpora used by the present invention ^ when the attractor (vector) is { aj and the document vector 

in performing a search operation. * 5 {^J* where 

Jtior to p^oni^jg a teowsbiig opmddon, a document xa^^i _ A . 

''oo^sls'miMrfroin' a coifralnriit 2Z Corpus input 22 ' 

includes on-line search databases and Internet access capa- As illustrated in FIG. 3, initially a document is selected 

bility so that the corpora may include an entire document x (step 72) from the corpus or subcarpus. depending upon the 

database, or a subset thereof identified in response to a input, and a document vector {d/}, associated with the 

query. The corpus is then structured by central processing document is created (step 74). This process is repeated for 

unit 14 in response to software code executed according to each document via (test step 76) until each document is 

the teachings of the present invention. represented by a vector stored in memory. In a preferred 

Display 24 is provided for displaying results of structur- 25 embodiment, the vector memory is comprised of a single 

ing procedures, and for permitting the user to interface with one-dimensional array for each document After the docu- 

the operating programs. A user input device 26 including, ment vectors are stored they are then analyzed (step 78), 

but not limited to, a mouse, a keyboard, a touch screen or preferably using one of the analysis methods described by 

combinations thereof is provided for input of commands and Pedersen et al. in U.S. Pat No. 5.442,778. The analysis, in 

selections made by the user. An IOT 28 can also be provided x conjunction with the user input of the required number of 

so that documents, as well as printouts containing cluster cluster structures (step 80) results in a preliminary set of 

digest summaries, may be rendered in hardcopy form. clusters. However, the present invention further modifies 

The system 12 is preferably based in a digital computer these clusters as is reflected by the structuring step 82. 

that can implement an off-line preparation of an initial Specifically, structuring step 82 first identifies the specified 

structuring using the iron-disjoint (overlapping) clusters 35 number of attractors from the analyzed vectors (step 84) and 

technique for the reasons discussed above. The system 12 then assigns each document to a particular cluster based 

alsodetennmesasinmriaryofto upon its "closeness** to the clusters attractor as, for 

corpus which can be presented to user via display 24 or example, described above. However, the present invention 

printer 28 for user interaction. After receiving appropriate further modifies the closeness criterion, which would result 

instructions from a user via user input device 26, system 12 40 ^joint (non-overlapping) dusters. The modification of 

can perform a further structuring of the corpus again using mc clustering is accomplished by adding to each cluster at 

the non-dis joint clustering technique described herein. least one document found in another duster. As illustrated 

The browsing technique upon which the present invention ^ the number of additional dusters is determined 

is based can be demonstrated in more detail through refer- 115 a fr*** 0 * of the number of documents in each duster. For 

eoce to the general steps illustrated in FIG. 2. Subsequently, 45 example, if a cluster contained 50 documents after the 

an example will be described showing the detailed steps and disjoint clustering steps and a predefined fraction of ¥2 was 

associated output data (output via a display or printer). defined, then the twenty-five next closest document vectors 

Initially, step 50 prepares an initial structuring of the corpus, w< ^ d identify twenty-five additional documents to be added 

or subcarpus identified by a query. Once structured into a t0 " lc c * uster ' 

predefined number of overlapping dusters, a summary of 50 1x1 Ac alternative embodiment illustrated in FIG. 4, the 

each duster is prepared at step 52. The summary is prefer- document addition rule of step 88 is replaced by step 91 in 

ably a list of features, extracted from documents assigned to ^during operation 82, so that all documents having a 

each duster, representing the primary features for which the COSknc wilh attractor falling within a range (A) of the 

similarity dusters are grouped (eg., the most frequently maximum cosine between the document and any other 

occurring features). The structuring, both initial and 55 attractor are added to the duster. Expressed mathematically, 

subsequent, is accomplished in accordance with the specific documents added to the duster are those where: 

steps illustrated in FIG. 3. Once the subcorpus has been Laj& "^ZatfrA O) 
structured and summarized, it is then possible for a user to 

select a duster, step 54, and to cause the system to further where the range of A is preferably from about 0.05 to about 

structure the selected duster into the predefined number of 60 0. 10. 

dusters, step 56, once again with some or all of the dusters To further illustrate the effect of the modified dustering 

being overlapping (e.g., containing common documents). rules represented in steps 88 (FIG. 3) and 92 (FIG. 4), An 

Although it would be possible for the user to select specific exemplary search- stnidure-summary- structure operation 

documents within the dusters at this point, it will be (FIG. 2) will be described with respect to FIGS. 5-8. 

appreciated that further iterations of the structure- 65 Referring initially to FIG. 5, an initial corpus 100 is searched 

summarize-sdect steps may be conducted to further reduce using a preliminary search 102 that is defined by a series of 

the number of documents within individual dusters. words 103 forming a query to produce a subcorpus 104. The 
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subcorpus is defined for this example as documents identi- clusters 122. again including a total of five clusters mini- 
fied in response to the query related to the topic "criminal bered 124. 126. 128, 130, and 132. 
actions against officers of failed financial institutions,* and Turning now to FIG. 7, groups 122 of FIG. 5 are illus- 
the actual search terms 103 Included "bank, financial trated in the user interface window 1516, again as a series of 
institution, failed, criminal, officer, indictment" and the goal 5 five windows numbered 180. 182. 184, 186 and 188. As 
is to find documents related to the topic. To create the initial previously described with respect to FIG. 6. each of the 
structure or partition, step 50 of FIG. 2. the disjoint clus- windows contains a banner and a list section and associated 
tering method of FIG. 3 is first applied to the subcorpus. The buttons and information therein. As is apparent there is 
output of structuring step 50 is a set of five document more overlap present in the second iteration of the overlap- 
clusters each having a primary topic, for example, clusters 10 ping clusters as shown in FIG. 7. In particular, one will note 
112-120 as shown in group 110 of FIG. 5. It will be that if a total of all documents appearing in each cluster is 
appreciated that it is possible to employ alternative numbers taken, the total adds up to a number greater than the 217 
of clusters and the number may be specified by the user via documents indicated to be in the subcorpus associated with 
the user interface. In the example, a user is presumed to have cluster 3 in FIG. 6 (groups 116 of FIG. 5). In fact the total 
input "5** as the desired number of clusters. Subsequently, is is approximately 25% greater, indicating that each of the 
the user preferably selects only one of the clusters in group clusters has approximately up to a 25% overlap of docu- 
118 based upon summary data presented as shown in FIG... .... ments with . one jot more of the. other, clusters. Again. .as. an, 

6. The output of the clustering operation being a list far each example of overlapping documents, it can be seen that 

cluster, as indicated by step 90 of FIG. 3. it will be cluster windows 184 and 186 both list document 366321 

appreciated that such a list will preferably be stored in 20 therein. It will be appreciated that the lists may be ordered 

memory as a hierarchical structure so as to be recallable and in the cluster windows by descending order of their simi- 

forma table using any of a number of well-known user larity to the cluster attractor (principal feature terms for 

interface display techniques. Moreover, die cluster list data which are listed in the terms field 170). Accordingly, those 

may also be recorded in a database or record format in RAM documents that overlap are likely to be lower in the lists as 

so as to be display able/accessible using well-known data- 25 they are "added" to the clusters by extending the initial 

base software. Accordingly, there is no intent to limit the structures as previously described in detail with respect to 

present invention to the display examples that will be the structuring operation of FIGS. 3 and 4. 
discussed below. Referring briefly to FIG. 8. it is also possible for the user 

Turning to FIG. 6. displayed therein is an exemplary user to select a particular cluster at this level (e.g.. cluster 3 of 
interface screen 150 that displays a window 151a having 30 FIG. 7) and to expand the window displaying that cluster as 
five cluster windows 152, 154, 156, 158 and 160 contained indicated by window 151c. It will be appreciated that a user 
therein. In each of the windows, there is displayed a banner may further select one or more documents from the enlarged 
162 that contains, in left to right order, a selection button view (or also from the smaller window views of FIGS. 6 and 
164. a cluster number field 166. and a cluster size field 168. 7) so that the document may be retrieved, printed or other- 
Also included in the banner 162 is a terms field 170 that 35 wise selected for later use. Such a selection process is 
contains one or more terms representative of the features facilitated by the use of selection buttons 190 appearing 
identified as most prevalent during the similarity analysis or. along the left of window 151c, although selection may also 
in other words, the features most descriptive of the docu- be enabled by placement of a pointing device on the docu- 
ments that have been structured into the cluster. As is seen ment line and indicating a selection. Those documents 
in FIG. 6, the five clusters have various numbers of docu- 40 selected are preferably indicated by a darkened selection 
meots grouped therein. The documents are displayed in lists. button, as indicated by button 192, by highlighting the 
each entry in the list containing in left-to-right order a document title or by reversing the text foreground and 
document identifier, document title, and additional infarma- background colors of the document title, 
tkm (e.g.. the name of the writer). In addition, the scrolling As is illustrated by the above example, once a family of 
bar 172 to the right side of each window allows a user to 45 overlapping clusters has been established, a choice between 
scroll through each cluster's list using up and down arrow favoring recall and favoring precision may be made imple- 
keys as is well known in windows-based user interfaces. mented for a user. Recall is favored by using the overlapping 

Continuing with FIG. 6. the result of the overlapping clusters themselves. In fact, the overlaps are deliberately 
clustering operation is apparent If one looks closely at the included to promote recall (admittedly at the expense of 
listings in clusters 1 and 3 (reference numerals 152 and 156. 50 precision). However, if it is desired to emphasize precision, 
respectively). Document number 334160 appears in both the present invention makes it possible to further define 
clusters. Thus, clusters 1 and 3 overlap by at least a single inner clusters using the afore-described techniques. In 
document Moreover, if a user wished to further structure particular, inner clusters would be comprised of those docu- 
either of the two clusters, document 334160 would be ments appearing in only a single cluster after the overlap- 
retained and restructured regardless of whether cluster 1 or 55 ping cluster operations are performed as described above. In 
cluster 3 were selected via user interface window 151a. other words, use of inner clusters eliminates those docu- 

Referring again to FIG. 5 in conjunction with FIG. 6. a ments that are close to the boundaries of the clusters of 

user may choose to further review cluster 3. However, the corresponding non-overlapping clusters. The steps associ- 

duster contains 217 entries and it may be desirable to further ated with the determination of inner clusters are illustrated 

structure the cluster. As indicated in FIG. 6, the user's 60 in FIG. 9. 

selection of a selectable unit — button 174 (shown darkened Referring to FIG. 9, the data input to the inner cluster 

to indicate selection) triggers a second structuring of the determination process 200 is the cluster lists generated at 

subcorpus now defined by the 217 documents in cluster 3. step 90 of FIG. 3. At step 202, the overlapping cluster lists 

Here again, the structuring (FIG. 2, step 56) is accomplished are retrieved from RAM memory where they were stared as 

using one of the two possible overlapping cluster methods 65 described above. It will be appreciated that the process of 

represented by die detailed steps of FIGS. 3 or 4. The results retrieving the list data may temporarily create a copy of the 

of the restructuring are illustrated in FIG. 5 as a group of lists in a second section of RAM. Step 204 then identifies, 
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from the retrieved list data, those documents that appear in 
more than one cluster list Subsequently, step 206 modifies 
the list data stored in the second memory section by elimi- 
nating those documents identified in step 204 from all of the 
cluster lists in which they appear. The resulting output (step 
208), data in the second memory section, then reflects 
clusters having only those documents that appeared in a 
single cluster. This output is then the inner cluster informa- 
tion that is more precise than that which is obtained either 
from a non-disjoint or a disjoint clustering of all documents 
in the input corpus. Alternatively, since the inner clusters 
need not span the next-upward cluster to which they belong, 
intermediate clusters can be defined corresponding to each 
of the overlapping clusters and consisting of all documents, 
in mat overlapping cluster, for which the corresponding 
attractor is the closest attractor. 

-It will be further appreciated that the overlapping cluster 
aspect of the present invention may be applied to alternative 
searching or browsing applications. For example, overlap- 
ping clusters may be employed to precompute a hierarchical 
structure for a document corpus. Once precomputed, the 
resulting structure may be employed to improve the com- 
putational efficiency of query-based searches, by effectively 
reducing the number of documents to which the query 
metric must be applied. 

In operation, overlapping clusters would be used to avoid 
the previously described problem of margin ality. Preferably, 
the process would begin by subdividing a carpus of docu- 
ments into the hierarchical structure. Bach level of the 
structure consisting of clusters (nodes) that overlap to a 
certain extent with other clusters at the same hierarchical 
level. The structure would extend down to a level in which 
each cluster contained no more than a predefined maximum 
number of documents. Although precomputation of the 
hierarchical structure is perhaps computationally intensive, 
the advantage is that the entire corpus only need be subdi- 
vided once. 

Once the subdivision is accomplished, keyword or other 
types of queries may be run against selected clusters or 
branches of the structure. Moreover, the clusters upon which 
such searching is done may be selected manually by a user, 
or automatically based upon summary data, e.g., its attractor, 
for each cluster. Once such a selection is accomplished, a 
subcorpus consisting of all documents in the selected clus- 
ters may be obtained, where the subcorpus contains fewer 
documents than the initial corpus from which it was 
selected. Preferably, the query is then executed against only 
the next-level clusters in this subcorpus to identify those 
documents in the subcorpus providing a positive response to 
the search query. 

In recapitulation, the present invention is a method and 
apparatus for document clustering-based browsing of a 
corpus of documents, and more particularly to the use of 
overlapping clusters to improve recall. The present inven- 
tion is directed to improving the performance of information 
access methods and apparatus through the use of non- 
disjoint (overlapped) clustering operations. The present 
invention is further described in terms of two possible 
methods for expanding document clusters so as to achieve 
the overlap, and a method for increasing precision through 
the use of the inner portions of the overlapping clusters. 

It is, therefore, apparent that there has been provided, in 
accordance with the present invention, a method and appa- 
ratus for improving recall and precision in document brows- 
ing operations. While this invention has been described in 
conjunction with preferred embodiments thereof, it is evi- 
dent that many alternatives, modifications, and variations 
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will be apparent to those skilled in the art Accordingly, it is 
intended to embrace all such alternatives, modifications and 
variations that fall within the spirit and broad scope of the 
appended claims. 
5 What is claimed is: 

1. A method, operating in a digital computer, for searching 
a corpus of documents, comprising the steps of: 

preparing an initial structuring of the corpus into a plu- 
rality of primary overlapping clusters, wherein at least 
10 two of the plurality of primary overlapping clusters 
contain a single document, wherein the step of prepar- 
ing an initial structuring of the corpus includes the steps 
of 

(a) identifying an attractor for each of a plurality of 
15 clusters, 

(b) including in each of the plurality of clusters all 
documents for which the cluster produces the closest 

attractor. and 

(c) adding to the cluster those additional documents 
20 numbering a predefined number of documents for 

which the cluster provides an attractor whose differ- 
ence in closeness from the closest attractor is mini- 
mized; and 

deternuning a summary of the plurality of primary over- 
35 lapping clusters prepared by said initial structuring of 
the corpus. 

2. The method of claim 1. wherein the predefined number 
is <ktermined as a fraction of the number of documents for 
which the cluster produces the closest attractor. 

30 3. The method of claim 2. wherein the fraction of the 
number of documents is in the range of 0.10 to 0.25. 

4. A method, operating in a digital computer, for searching 
a corpus of documents, comprising the steps of: 

35 preparing an initial structuring of the corpus into a plu- 
rality of primary overlapping clusters, wherein at least 
two of the plurality of primary overlapping clusters 
contain a single document, wherein the step of prepar- 
ing an initial structuring of the corpus includes the steps 

40 0f 

(a) identifying an attractor for each of a plurality of 
clusters, 

(b) including in each of the plurality of clusters all 
documents for which the cluster produces the closest 

45 attractor, and 

(c) adding to the cluster those additional documents for 
which a cosine of the document with respect to the 
attractor of the duster is at least equal to the largest 
cosine for the document minus a predefined value; 

50 

determining a summary of the plurality of primary over- 
lapping clusters prepared by said initial structuring of 
the corpus. 

5. The method of claim 4, wherein the predefined value is 
55 in the range of 0.05 to 0.10. 

6. The method of claim 4, wherein the number of addi- 
tional documents added to the cluster is limited to a prede- 
termined fraction of the number of documents for which the 
cluster produces the closest attractor. 

60 7. The method of claim 6. wherein the predetermined 
fraction of the number of documents is in the range of 0. 10 
to 0.25. 

8. A method, operating in a digital computer, for searching 
a corpus of documents, comprising the steps of: 
65 preparing an initial structuring of the corpus into a plu- 
rality of primary overlapping clusters, wherein at least 
two of the plurality of primary overlapping clusters 
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contain a single document, wherein the step of defining 
an inner cluster associated with at least one of the 
plurality of primary overlapping clusters, includes the 
steps of 

(a) retrieving, from a first memory, data representing a 
summary of the plurality of primary overlapping 
clusters prepared by said initial structuring. 

(b) identifying, from the summary data, all documents 
appearing in more than one of said primary overlap- 
ping clusters. 

(c) modifying the summary data by eliminating those 
documents identified in step (b) from the summary 
data for each cluster in which it appears, and 

(d) storing the modified summary data in a second 
memory section so as to represent inner clusters 
consisting only of documents that appear in a single 

determining a summary of the plurality of primary over- 
lapping clusters prepared by said initial structuring of 
the corpus; and 

defining an inner cluster associated with at least one of the 
plurality of primary overlapping clusters, said inner 
cluster consisting of documents found only in the at 
least one of the plurality of primary overlapping clus- 
ters. 

9. The method of claim 8. further including the steps of: 

(a) identifying all documents, in the primary overlapping 
cluster, for which a corresponding attract or is the 
closest attractor; and 

(b) defining as an intermediate duster also associated with 
at least one of the plurality of primary overlapping 
clusters all documents identified in step (a). 

10. A document search and retrieval method operating in 
a digital computer, for searching a corpus of documents, 
comprising the steps of: 

identifying, in response to at least one user specified 
search term, a sub-corpus of documents containing the 
at least one user specified search term; 

preparing an initial structuring of the sub-corpus into a 
plurality of primary overlapping clusters, wherein at 
least two of the plurality of primary overlapping clus- 
ters contain a single document, wherein the step of 
preparing an initial structuring of the corpus includes 
the steps of 
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(a) identifying an attractor for each of a plurality of 
primary overlapping clusters. 

(b) including in each of the plurality of primary over- 
lapping clusters all documents for which the primary 
overlapping cluster produces the closest attractor, 
and 

(c) adding to the primary overlapping cluster those 
additional documents numbering a predefined num- 
ber of documents included in step (b), for which the 
primary overlapping cluster produces a next closest 
attractor; and 

deterrnining a summary of the plurality of primary over- 
lapping clusters prepared by said initial structuring of 
the sub-corpus. 

11. A document search and retrieval method, operating in 
a . digital computer, for searching a corpus .of, documents. . 
comprising the steps of: 

identifying, in response to at least one user specified 
search term, a sub-corpus of documents containing the 
at least one user specified search term; 

preparing an initial structuring of the sub-corpus into a 
plurality of primary overlapping clusters, wherein at 
least two of the plurality of primary overlapping clus- 
ters contain a single document, wherein the step of 
preparing an initial structuring of the corpus includes 
the steps of 

(a) identifying an attractor for each of a plurality of 
primary overlapping clusters. 

(b) including in each of the plurality of primary over- 
lapping clusters all documents for which the primary 
overlapping cluster produces the closest attractor, 
and 

(c) adding to the primary overlapping cluster those 
additional documents for which a cosine of the 
document with respect to the attractor of the primary 
overlapping cluster is at least equal to the largest 
cosine for the document minus a predefined value; 
and 

determining a summary of the plurality of primary over- 
lapping clusters prepared by said initial structuring of 
the sub-corpus. 
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